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STOCHASTIC OPTIMAL CONTROL VIA BELLMAN’S 

PRINCIPLE 


Luis G. Crespo* and Jian Q. Surd 


ABSTRACT 

This paper presents a method for finding optimal controls of nonlinear systems subject 
to random excitations. The method is capable to generate global control solutions 
when state and control constraints are present. The solution is global in the sense 
that controls for all initial conditions in a region of the state space are obtained. 

The approach is based on Bellman’s Principle of optimality, the Gaussian closure and 
the Short-time Gaussian approximation. Examples include a system with a state- 
dependent diffusion term, a system in which the infinite hierarchy of moment equations 
cannot be analytically closed, and an impact system with a elastic boundary. The 
uncontrolled and controlled dynamics are studied by creating a Markov chain with 
a control dependent transition probability matrix via the Generalized Cell Mapping 
method. In this fashion, both the transient and stationary controlled responses are 
evaluated. The results show excellent control performances. 

keywords: Stochastic optimal control, Bellman’s principle, Cell mapping, Gaussian 
closure. 

1 INTRODUCTION 

Optimal control of stochastic nonlinear dynamic systems is an active area of research due 
to its relevance to many engineering applications. This is a very difficult problem to study, 
particularly when the system is strongly nonlinear and there are constraints on the states 
and the control. Optimal feedback controls for systems under white-noise random excitations 
may be studied by the Pontryagin maximum principle, Bellman’s principle of optimality and 
the Hamilton-Jacobi-Bellman (HJB) equation. When the control and the state are bounded, 
the direct solution of the HJB equation faces exigent difficulties since it is multidimensional, 
nonlinear and defined in a domain that in general might not be simply connected. The 
theory of viscosity solutions, first introduced by Crandall et. al. [21], provides a convenient 
framework for studying the HJB equation. Very few closed form solutions to this problem 
have been found so far. Dimentberg et al. found the analytical solutions of the optimal 
control of a linear spring mass oscillator with Lagrange and Mayer cost functionals in a 
region of the phase space [2, 8]. 

Given the intrinsic complexity of the problem, we must resort to numerical methods to 
find approximate control solutions [1, 11]. While some numerical methods of solution to 
the HJB equation are known, they usually require knowledge of the boundary/asymptotic 
behavior of the solution in advance [2], Relatively few problems are known today which are 
solved through the use of the HJB equation. Zhu et al. proposed a strategy for optimal 
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feedback control of randomly excited structural systems based on the stochastic averaging 
method for quasi-Hamiltonian systems and the HJB equation [22], 

The cell mapping methods, originally developed by Hsu, have been applied to the optimal 
control problem of deterministic systems [9]. Strategies to solve optimal control problems 
with fixed final time [4] and fixed final state [5] terminal conditions have been developed and 
applied. The Generalized Cell Mapping (GCM) method is a very effective tool for studying 
the global behavior of strongly nonlinear stochastic system. The GCM method is integrated 
with the Short-time Gaussian Approximation (STGA) scheme to provide a very efficient way 
for constructing Markov chains that describe the global dynamics of the system [10, 16, 18]. 
In this paper, these tools are extended to the stochastic control problem. Specifically, we 
use Bellman’s Principle of optimality and the STGA to generate global control solutions to 
stochastic control problems with fixed-state terminal conditions and state and/or control 
constraints. 

The current method, that involves both analytical and numerical steps, offers several 
advantages. It can handle strongly nonlinear and non-smooth systems, can include various 
state and control constraints and leads to global solutions. Former developments on the 
subject can be found in [3], where numerical and analytical solutions were successfully com- 
pared. In this paper several extensions to that work are proposed and evaluated. The content 
of this paper is organized as follows. Section 2 briefly introduces the problem formulation 
and the Bellman’s principle. The analytical and numerical parts of the solution method are 
presented in Section 3. Examples are provided in Section 4, where the following systems 
are controlled: a nonlinear system with a state-dependent diffusion part, a system with 
non-analytically closeablc terms and an impact system with a reflective boundary condition. 
Some conclusions are stated in Section 5. 


2 STOCHASTIC OPTIMAL CONTROL 
2.1 Problem Formulation 


Consider a system governed by stochastic differential equation (SDE) in the Stratonovich 
sense: 


dx(f) = m(x(f), u (t))dt + <r(x(f), u(f))dB(f) 


( 1 ) 


where x (f) G R n is the state vector, u(f) G R m is the control input, B(f) is a vector of 
independent unit Wiener processes and the functions m(-) and <r(-) are in general nonlinear 
functions of their arguments. By using Ito’s formula [14], we obtain the corresponding Ito 
SDE: 


dx(f) = ( m(x, u) + 


1 <9<t(x, u) 

2 &T 


-a x, u 


dt + a(x, u)dB(f) 


( 2 ) 


The associated Fokker-Planck-Kolmogorov (FPK) equation for the conditional probability 
density function (PDF) p(x, t 0 l x o, ffi) is given by: 


dp 

dt 


d_ 

<9x 


P 


^m(x, u) + 


1 <9<r(x, u) 

2 Ihc 


<x(x, u) 



i d 2 
+ 2 fa? 


pcr(x, u)cr(x, u) T 


( 3 ) 


Define the cost functional as: 


J(u,x 0 ,f 0 ,T) 


E 


0(x(T),T)+/ L(x(t),u(t))dt 


't 0 


( 4 ) 
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where [to, T ] is the time interval of interest, 0(x(T), T) is the terminal cost, and L(x(t), u(f)) 
is the Lagrangian function. The optimal control problem is to find a control u(f) within a 
set U C R m on the time interval [t 0 , T\ that drives the system from a given initial condition 
x(f 0 ) = x 0 to the target set defined by \k(x(T), T) = 0 such that the cost J is minimized. 

2.2 Bellman’s Principle of optimality 

Let (x*,u*) be an optimal control solution pair over the time interval [to,T] subject to the 
initial condition x(fo) = Xo. Let t be a time instant such that t® < t < T. Then, (x*,u*) is 
still the optimal control solution pair from [t, T] subject to the initial condition x(t) = x*(t). 

Let V(x. 0 ,t 0 ,T) = J(u* , x 0 , f 0 , T) be the so-called value function or optimal cost function. 
Bellman’s principle of optimality can be stated as [21]: 


V (x 0 , t 0 , T) = inf E 

uGU 


L(x.(t),u(t))dt+ [ L(x*(t),u*(t))dt + 0(x*(T),T) I (5) 


'to 


where to < t < T. Consider the optimal problem of the system starting from Xj in the 
time interval [zr, T] where r is a discrete time step. Dehne an incremental cost and an 
accumulative cost as: 


J r = E 


"(i+ 1 )t 


L(x.(t)u(t))dt 


( 6 ) 


Jt 


E 


0(x*(T),T) + 



L(x* (t), u* {t))dt 


( 7 ) 


where (x*(f), u*(t)) is the optimal solution pair over the time interval [(i + l)r, T]. Then, 
Bellman’s principle of optimality can be restated as: 


V{x.i,ir,T) = inf {J T + Jt} (8) 

ueu 

The incremental cost J T is the cost for the system to march one time step forward starting 
from a deterministic initial condition x, : . The system lands on an intermediate set of the state 
variables. The accumulative cost Jt is the cost for the system to reach the target set starting 
from this intermediate set, and is calculated through the accumulation of incremental costs 
over several short time intervals between (i + 1 )r and T . 

Bellman’s principle of optimality as stated in Equation (8) suggests that one can obtain a 
local solution of the optimal control problem over a short time interval r to form the global 
solution provided that certain continuity conditions on the solution are satisfied. In this 
paper, we shall impose such conditions in the probability sense. This is explained later in 
the paper. The global solution consists of all the intermediate solutions that are constructed 
backward in time starting from the terminal condition 0(x(T),T) at time T. 

3 THE SOLUTION APPROACH 
3.1 Background 

We need to evaluate the expected values in Equations (6) and (7) for a given control. For 
this we need the conditional probability density function p(x, r|x o ,0) which satisfies the 
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FPK equation associated with the process. For a given feedback control law u = f(x(f)), the 
response x(f) is a stationary Markov process [12], For a very small r, this density function 
p(x, t|x o ,0) is known to be approximately Gaussian within an error of order 0(r 2 ) [14], 

When p(x, r|x o ,0) is Gaussian, only the first and second order moments of x(f) need to 
be evaluated in order to completely specify the PDF. We can readily derive the moment 
equations from the Ito equation (2). When the system is nonlinear, the moment equations 
can be closed by applying the Gaussian closure method [12, 17], which is consistent with the 
short-time Gaussian approximation. 

The short-time conditional probability density function p(x, r|x 0 , 0) gives a probabilistic 
description of the system response x(t). The short-time solution has been implemented 
in two ways in the literature. The first one is the path integral [13, 19, 20], which treats 
the phase space as a continuum. The second approach is the cell mapping method, which 
discretizes the phase space into small regions, called cells. Since the path integral often has 
to be evaluated numerically, discretization of the phase space is inevitable. 

3.2 The backward search algorithm 

The backward solution process starts from the last segment over the time interval [T — r, T], 
Since the terminal condition for the last segment of the local solutions is specified, we can 
obtain a family of local optimal solutions for all possible initial conditions x(T — r). Then, 
repeat this process to obtain the next segment of the local optimal solution over the time 
interval [ T — 2 r, T] subject to the continuity condition at t = T — t. 

In general, the optimal control in the time interval [ir, T] is determined by minimizing the 
sum of the incremental cost and the accumulative cost in Equation (8) leading to R(xj,ir, T ) 
subject to the continuity condition x((i + l)r) = x* +1 where x* +1 is a set of initial conditions 
used in the problem with time interval [(i + 1 )r, T\. x* +1 is a random variable. The equality 
x((i+l)r) = x* +1 has to be interpreted in a probabilistic sense. In this paper, it is interpreted 
in the sense of maximum probability. Quantitatively, this is done as follows. 

In theory, the conditional PDF p(x, r|x o ,0) of a diffusion process even for a very short 
time r will cover the entire phase space. Let f2 be the extended target set to be defined later 
such that x* +1 e fl. For a given control, we define Pci as: 

Pn= [ p(x, r|xj, 0)dx (9) 

J xen 

Pci represents the probability that the system enters the extended target set 17 in time r 
when it starts at x, with probability one. The controlled response x(f) starting from a set of 
initial conditions x., will become a candidate for the optimal solution when Pci is maximal 
among all the initial conditions under consideration. 

Since we shall use a cellular structure of the phase space, we consider a finite region 
D C R n and discretize D into a countable number of small cells. Let U denote a countable 
set consisting of all the admissible controls Uj (i = 1,2, • • • ,/). We shall assume that the 
control is constant over each time interval [ir, (i + l)r]. Let f2 C R n denote the set of 
cells that form the target set defined by T r (x(T),T) = 0. As the backward solution steps 
proceed, the set f2 will be expanded. Assume that the terminal cost Jt is E[4>(x(T),T)]. 
The backward algorithm for searching the optimal control solutions is stated as follows: 
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1. Find all the cells that surround the target set 12. Denote the corresponding cell centers 
as z j. 

2. Construct the conditional probability density function p(x, t\zj, 0) for each control u* 
and for all cell centers z j. Let us call every combination (zj,u,) a candidate pair. 

3. Calculate the incremental cost J T (zj,Uj), the accumulative cost Jt( z£,u?) and P& for 
all candidate pairs, z * k is an image cell of z j in 12, and u? is the optimal control 
identified for z * k in previous iterations. 

4. Search for the candidate pairs that minimize J T (zj,Uj) + Jt(z£,u?) and satisfy Pn < 
Qmax{Pft}, where 0 < 0 < 1 is a factor set in advance. Denote such pairs as (z*, u*). 

5. Save the minimized accumulative cost function Jt( z*, u*) = J r ( z*, u*) + Jt(z£, u?) and 
the optimal pairs (z£,u?). 

6. Expand the target set 12 by including the cells z*. 

7. Repeat the search from Step (1) to Step (6) until all the cells in the entire region D 
are processed or until the original initial condition Xo of the optimal control problem 
is reached. 

As a result, the optimal control solution for all the cells covered by 12 is found. The 
choice of image cells, i.e. x* +1 , could certainly by biased. This however, is avoided by using 
(i) non-uniform integration times such that the growth of 12 is gradual, i.e. mapping most of 
the probability to neighboring cells, and (ii) by restricting the potential optimal pairs to be 
candidate pairs with high P^. These considerations led to the same global control solution 
regardless of the cell size [5]. 

The resulting dynamics of the conditional PDF is simulated using the Generalized Cell 
Mapping Method (GCM) [6]. Notice that if a simulation is done for a long time all probability 
will eventually leave D. This is a consequence of using a finite computational domain to 
model a diffusion process, therefore all numerical simulations face this problem. While 
some probability is leaking out from D, probability mass is also being brought back from 
its complement. In this study the computational domain is set such that the leakage of 
probability during the transient controlled response is very small. 

4 EXAMPLES 

4.1 State dependent diffusion terms 

Consider the dynamic system: 

x + ii{g + h)sgn(i;) + 2(x + u 2 x + ex A — f + u(t ), (10) 

where x(t) describes the horizontal sliding motion of a mass block placed on a moving 
foundation with rough contact surface and u(t) is a bounded force satisfying |u| < u — 1 
[15, 16]. C is th e viscous damping coefficient, /j is the dry friction damping coefficient, g 
is the gravitational acceleration, oo o is the natural frequency of the linear system and £ is 


5 



the nonlinear stiffness coefficient. Assume that / and v are correlated Gaussian white noise 
processes such that: 


E 


f 


= 0, E [h] = 0, E 


v(t)f(t') = 2D vf 5(t - t'), 


E[v(t)v(t')}=2D v 8(t-t'), E mm =2 D f 5(t-t'). 


Let X\ = x and a; 2 = x. The cost functional to be minimized is defined as: 


J 


E 


(a|x| 2 + f3u 2 )dt 


Ut 0 


( 11 ) 


The control objective is to drive the system from any arbitrary initial condition to the origin 
of the phase space while J is minimized. Following the rules of the Ito calculus, we convert 
Equation (10) into a set of stochastic differential equations (SDEs) in the Ito sense: 


dx i = x 2 dt, 

dx 2 = [— y 'igsgn(x 2 ) - 2(x 2 - u 2 xi - exf + /i 2 D v sgn(x 2 )sgn'(x 2 ) + fxD v fSgn.' (x 2 ) + u] dt 
+ [2n 2 D v sgn 2 (x 2 ) + AnD v f sgn(x 2 ) + 2 D f ] 1/2 d,B(t), (12) 

where B{t ) is a unit Wiener process satisfying E[B(t)] = 0, E[B{t)B{t')} — t — t' where 
t > t' . Details of the derivation of Equation (12) are given in Appendix A. An infinite 
hierarchy of moment equations for the state variables can be derived from the Ito equation. 
Defining m t] = E[x\xd 2 ], and applying the Gaussian closure method by using the expressions 
in Appendix C, we obtain a set of nonlinear differential equations for the first and second 
order moments: 


rh w = m 0 1 m 20 = 2 c 12 + 2m 10 m 10 

rhoi = ixgsgn(m 0 i)erf(\rn 0 i\/V2cr2) - 2Cm 0 i - u 2 m w - ern w {3al + m 2 10 ) 

+ gD v f(2/ V2Tr<7 2 ) exp(-^m 01 /a 2 ) 2 ) + u 

mu =o 2 2~ 2(ci2 - u 2 0 a\ - 3ea 2 (a 2 + m 2 w ) - fig^2jn(ci2/cr2) exp(-i(m 0 i/a 2 ) 2 ) 

- nD vf (ci 2 m 0 i/ \J2 Jtxo\) exp(-i(m 0 i/cr 2 ) 2 ) + m w m m + m w m m (13) 

rh 02 = — 4Ccr| - 2 u 2 ci 2 - 6eci 2 (o- 2 + m 2 0 ) - 2ngy/2/^a 2 exp(-i(m 0 i/a 2 ) 2 ) 

- nD vf y/2/n(m 0 i/a2) exp(--(m 0 i/o- 2 ) 2 ) + 2fi 2 D v + 2 D f + 2m m m m 
+ 4//D„/sgn (m 0 i ) erf ( | m 0 i | / V2a 2 ) 

where Ci 2 = mu — moimoi, erf = m 2 o — m\ 0 and cr| = m 02 — rn 2 n . Here, c ]2 is the 
covariance, of and cr 2 are the variance of x\ and x 2 respectively. 

The initial conditions required to integrate Equations (13) from t = 0 to t = r are 
specified by the coordinates of a cell center (xi,x 2 ), he. mi 0 (0) = Xi, m 0 i(0) = x 2 , m 20 ( 0) = 
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Xi, 777 . 02 ( 0 ) = xl and mn(0) = X 1 X 2 . After obtaining the time evolution of the first and 
second order moments, we construct a joint Gaussian probability density function (PDF) 
for (aq,£ 2 ) from time t — 0 to t — r. With this PDF, the cost functional and other response 
statistics over one time step r can be readily calculated. The system is parametrically and 
externally excited and diffusion term is a nonlinear function of the state. 

Numerical solutions and discussion 

We consider a region D = [—2,2] x [—2,2], It is discretized with 25 x 25 = 625 uniform 
cells. The parameters of the system are set as follows: /j = 0.05, Q = 0.1, o; 0 = 1, £ — 1, 
D v = 0.1, Df = 0.1, and D v f = 0. The Lagrangian of the cost function in Equation (11) 
is specified using a — /3 — 0.5 and the control set is uniformly discretized into 11 levels: 
u G {—1, —0.8, • • • ,1}. 

The reader must notice that there is a region on the aq-axis about the origin, where the 
mean trajectories of the uncontrolled response are trapped due to the effect of dry friction. 
In Equation (13), when the term /agsgn^moijer/Qmoil/v^oq) is dominant, a never ending 
sequence of changes in the sign of aq takes place. This term causes the response to switch 
indefinitely about the trapping region without having a net displacement. Figure 1 shows 
the vector field of the mean trajectory for the controlled response. The size of the arrows 
about the origin is enlarged to enable better observation. Jumps in the velocity are still 
present, but the trapping region on the aq-axis is not. These jumps show that the velocity 
just before and after reaching maximum elongation of the spring differ in magnitude. 

Now we evaluate the time evolution of the system response starting from a uniformly 
distributed initial condition. This initial condition allows us to study the control performance 
on the entire computational domain. The out crossing of any boundary of D is an irreversible 
process in the sense that D acts as a sink cell [10]. The reader should notice that such 
boundaries are leaking probability and eventually will absorb all of it. This however, is a 
mere consequence of having a bounded computational domain. In order to circumvent this 
difficulty, we define stationarity as the state in which the leakage of probability is stable and 
considerably small, e.g. both the probability of the system remaining in D and the expected 
value of the Lagrangian approach to a constant value with very small changes thereafter. 
Time evolutions starting from this pseudo-stationary state will be governed by the diffusion 
component of the dynamics. 

The stationary PDF of the uncontrolled system response is shown in Figure 2. Station- 
arity is reached after 10 time units when the 67% of the probability remains in D. Figure 3 
shows the stationary PDF for the controlled response. The controlled response keeps 82% 
of the probability in D and reaches stationarity in about 7 time units. In order to evaluate 
the control performance, the leakage of probability to the outside of D must be taken into 
consideration. We propose the following quantity to evaluate the control performance: 

Mt) = (14) 

-no >{t) 

where Po(t) is the probability of the system of being in D as defined in Equation (9). Figure 
4 shows the time evolutions of Jj$ and P© for both the uncontrolled and controlled system 
responses. By comparison we conclude that (i) the controlled response reaches the target 
with less cost, (ii) the stationary PDF of the controlled system is more highly concentrated 
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around the desired target set, (iii) a faster convergence to the stationary PDF takes place 
and (iv) a higher percentage of the probability Po(t) is kept in the domain D. 

4.2 Non-analytically closeable terms 

The closure methods can readily handle polynomials, ffowever, for other types of nonlin- 
earities the closure might not only require tedious and lengthy integrations [16], but also 
might lead to expressions that don’t admit a closed form. This problem prevents us from 
integrating the system of moment equations even numerically. In order to overcome this 
difficulty we make use of the cellular structure of the state space by approximating such 
nonlinearities with Taylor expansion. Once the approximations are available, the infinite 
hierarchy of moments can be readily closed. We illustrate this approach next. 

The example is concerned with the problem of driving a boat in a vortex held. The boat 
moves on the (aq,x 2 ) plane with a constant velocity relative to the water. Let the control u 
be the heading angle with respect to the positive aq-axis. The velocity held of the water is 
given by u(aq,aq) = ar/(br 2 + c) where r = \J x\ + x\ is the distance to the origin, i.e. the 
center of the vortex. The equation of motion of the boat is described by the SDE equations 
in the Stratonovich sense: 


aq = cos(w) — vx 2 /r + aim q ; (15) 

x 2 = sin (at) + uaq/r + avw 2 , 


where cc is a constant, w \ and w 2 are correlated Gaussian white noise processes with 
zero mean such that E [wi(t)wi(t')} = 2 Di5(t — t'), E [w 2 (t)w 2 (t')] = 2 D 2 S(t — t') and 
E [wi(t)w 2 (t')] = 2 D V2 5{t — t'). The control objective is to drive the boat from an initial 
condition to the target set defined by 4/ such that the cost function: 


J 


E 


Au(aq, x 2 )dt 


La t 0 


(16) 


is minimized. Here, we assume that the risk of navigating on the vortex is proportional to the 
tangential component of the vortex velocity. A is a scaling constant. The cost in Equation 
(16) represents the cumulative risk associated with the selected path. The derivation of the 
Ito equations is shown in Appendix B. 

Consider a special case when the white noise processes are uncorrelated. The Erst and 
second order moments of the state variables are derived as: 


/ \ ^ [%2V 

m w = cos(w) — E 

- r J 

\X\V 

m 0 1 = sm m + E 

- r . 

77120 = 277110 cos(ti) — 2 E 
r?io 2 = 2ttioi sin('u) + 2 E 


+ a 2 DiE 


+ olD 2 E 


dv 

dxi 

dv 

i 

dx 2 


~X\X 2 V~ 

+ 2a 2 D\E 

x i v dv 

r 

l 

~X\X 2 V " 

+ 2a 2 D 2 E 

x 2 v dv 

- r 

r dx 2 


+ 2a 2 DiE [u 2 ] 
+ 2 a 2 D 2 E [u 2 ] 


r x\v 


+ a 2 DiE 


x 2 v dv 
r dxi 


+ sin(w)77ii 0 + E 


x\v 


(17) 


+ a 2 D 2 E 


X\V dv 
r dx 2 


771n = COs(w)771oi — E 



Several expected values such as E \^y-\ in the above moment equations cannot be ana- 
lytically expressed by the lower order moments. That is to say, the Gaussian closure method 
cannot be used to analytically close the infinite hierarchy of the moment equations. 

Consider an initially delta PDF function to evolve over a short mapping time step r. The 
one step PDF function at t — r will still be concentrated in a small region with a mean not 
far away from the initial condition [14]. This fact suggests us to expand the non- analytically 
closeable functions about the initial condition, i.e. a cell center, in a Taylor series. The 
Taylor series provides an accurate representation of these nonlinear functions in a small 
neighborhood where the one step PDF function at t — r is defined. The polynomials of the 
Taylor series allow us to analytically express the higher order moments in terms of the lower 
order ones according to the Gaussian closure method. 

Let f(x 1 ,^ 2 ) denote a non-analytically closeable function such as — in Equation (17). 
A second order Taylor approximation f(xi,x 2 ) is given by: 

f(xi, x 2 ) = Pi + P 2 X 1 + P 3 X 2 + P±x\ + P 5 X 1 X 2 + (18) 

where the coefficients are calculated from the following conditions: 

f( X l, X 2 )\ z fix l;2q)|z! fxiX 2 fall X-f) |z fxiX 2 (*D) X 2 ) |zj 

fx2 (-Dj X 2 ) | z fx2 (X \ , X 2 ) | z , fxi(x l,X 2 )\ z f xi (xi , xf) | z , 

fx\X\ {xi,x 2 )\ z 

fx 1 X 1 {xi,x 2 )\ zi 

fx 2 X 2 {xi,X 2 )\ z 

fx 2 X 2 {xi,X 2 )\ z . 

In these equations, z stands for the state coordinates of the initial cell center and the 
subindices of the function refer to the partial derivative with respect to the state variable. 

The Taylor expansion is for each initial cell z, and is used only to integrate the moment 
equations over a short time interval (0, t). 

Numerical solutions and discussion 

The region D = [—2, 2] x [—2, 2] is discretized with 1089 square cells. A set of 30 evenly 
spaced angles values is taken as the control set, i.e. U = {— tt, — 147t/15, ..., 147t/ 15}. Notice 
that the set U represents an unbounded circular set of controls. The system parameters are 
chosen as follows: A = 1, a = 15, b — 10, c = 2, a = 1, — 0.05 and D 2 = 0.05. Let fl 

be the set of cells that form a discrete representation of the target set = {x\ = 2 , 2 : 2 }- El 
consists of the cells in the rightmost column of the discretized domain D. 

Figure 5 shows the mean vector field of the vortex. A trajectory of the boat freely moving 
in the vortex starting from x = (0, —1) is superimposed. The center of the vortex attracts 
probability due to the state dependence of the diffusion term and not to the existence of an 
attracting point at the origin. This behavior does not have a corresponding counterpart in 
the deterministic case. The time evolutions for E[L\ and the first and second order moments 
are shown in Figure 6. 

The vector field for the mean of the controlled response is shown in Figure 7. The 
trajectory of the boat under the optimal navigation starting from the same initial condition 
as in Figure 5 is superimposed. The corresponding time evolutions are shown in Figure 8. In 
the process, the control keeps the system in D with probability one. A discontinuity in the 
vector field exists in spite of having an unbounded control set. Such a discontinuity implies 
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a dichotomy in the long term behavior of the controlled response depending on the initial 
state of the system. 

While from the initial conditions above the discontinuity, the control will guide the boat 
against the velocity held of the vortex, from the initial conditions under, the control will 
move the boat in the direction of the current. Figure 9 shows the controlled response of a 
system starting from the deterministic initial condition x = (0.8, —0.32) after 3 time units. 
The two-peak density function occurs due to the discontinuity in the global control solution. 
Due to the diffusion effect of the random excitation, probability density functions that reach 
the discontinuity will bifurcate into two parts. 


4.3 Singular boundary conditions 

A vibro-impact system with a one-sided rigid barrier subject to a Gaussian white noise 
excitation is considered. The impact is assumed to be perfectly elastic. The control of 
the vibro-impact system is then subject to a state constraint given by the one-sided rigid 
barrier. Such a constraint makes the solution of optimal control problem very difficult to 
obtain analytically. 

A transformation of the state variables is used to effectively remove the non-smooth 
state constraint. The optimal control problem with fixed terminal conditions is accordingly 
transformed to the new domain and solved with the proposed method. Then, the control 
solution is transformed back to the physical domain, where it is evaluated using the GCM 
method. 

In the transformed domain the solution satisfies the constraints. However, the control 
constraint may be violated when the solution is transformed to the physical domain. For 
systems with optimal bang-bang solutions, the inverse transformation preserves the control 
bounds, and therefore all constraints are satisfied. 

The equation of motion for the vibro-impact system is given by: 

y + 9(y 2 - 1 )y + 2 (tty + G 2 y = u(t) + w(t), (19) 


subject to the impact condition: 


^ ft impact) y ft impact) yftimpact ) h, 

where ti mpac t is the time instant at which the impact occurs, y(t) is the displacement, w(t) 
is a Gaussian white noise process satisfying E[w(t)\ = 0, E[w(t)w(t + £)] = 2 D5(£) and u(t) 
is a bounded control satisfying u < it. 

Equation (11) with /3 = 0 is used here. The control objective is to drive the system from 
any arbitrary initial condition to the origin of the phase space, i.e. \E ' ( v ,y) = (0,0), while 
the energy of the system is minimized. The computational domain is chosen to be = 
[—h,a] x [— v,v\. As suggested in reference [7], we introduce the following transformation 
y — \x\ — h. This leads to the SDE for x : 

x + 9(x 2 — 2h\x\ + h 2 — l)x + 2C > Vtx + Vt 2 x = sgn (x)(u(t) + w(t) + M2 2 ). (20) 


Now, the transformed domain is given by = [— 2h, 2a] x [— v,v], the target set is = 

(±/i, 0), and the cost functional is given by: 


J x — E 



h) 2 + x 2 dt 


( 21 ) 
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Let X\ — x and X 2 = x. The Ito SDE of the system is given by: 

dx i = X2 dt (22) 

dx 2 = {—6x\x 2 + 20h\x\\x2 — (0T + 2(0,) X 2 — 0 2 x\ + sgn(xi)A) dt + sgn(x\)V2DdB 

where B(t) is a unit Wiener process, A = u + hO 2 and T = h 2 — 1. The first and second 
order moment equations of the state variables are given by: 

mi 0 = m 0 i, dr 20 = 2 m n 

rhoi = —0rri2i + 29hE[\xi\x2] — (9T + 2£fi) moi — fl 2 mio + i£[sgn(xi)]A$ 
mu = «ro2 — #777.31 + 2#AE , [|£i|£i£ 2 ] — (6T + 2(0) mu — 0 2 m 20 + ^[xiSgn(xi)]A (23) 

m 0 2 = ~26m 2 2 + 4:9hE[\xi\x 2 ] — 2 (9T + 2^12) m 0 2 — 2fi 2 ran + 2E , [x 2 sgn(x 1 )] A (24) 

+ 2DE[sgn(x 1 ) 2 ] 

Formulas in Appendix C can be used to close the above moment equations according 
to the Gaussian closure method. Next, we apply the current method and solve for optimal 
controls in the state domain B x with the cost function J x and the target set After 

obtaining the global optimal control solution u*{x i,x 2 ), we transform it back to the original 
domain leading to the optimal control of the original system u*(y, y). Notice that in general, 
this process leads to controls that don’t preserve the control constraints in B y 

The optimal control in both domains is bang-bang. The bang-bang control is fully de- 
termined by the switching curves in the phase space. From the solution u*(x i,x 2 ), we find 
numerical approximations of such curves and then we transform them back to the physical 
state space. 

Numerical solutions and discussion 

In the numerical example, we have chosen: u — 1, h = —1, a — 1, v — 1, 6 — 0, ( — 0, 
0 = 1 and D = 0.1. The transformed state space is IDL = [—2,2] x [—2,2], 849 square cells 
are used to discretize B x . The transformed target set is formed by the cells that contain the 
points '&{ x ,x) = (±1,0). Since the optimal control is bang-bang, we have U = {—1, 1}. 

The uncontrolled response of the system is marginally stable about the origin since 6 = 0 
and ( = 0. When the system impacts at y = h, it is reflected back in such a way that y 
remains the same and the sign of y is reversed. The crossing of any other boundary is an 
irreversible process in the sense that acts as a sink cell [10] . 

To demonstrate the behavior of the uncontrolled system, we consider an uniformly dis- 
tributed initial condition in B y = [—0.86, —0.78] x [—0.64, —0.57]. The PDF of the response 
after 3 time units is shown in Figure 10. At this time, only 22% of the probability remains in- 
side of B^ indicating the strong effect of diffusion on the marginally stable system. It should 
be noted that even the probability that remains inside B^ does not concentrate around the 
target. Time evolutions of the relevant responses are shown in Figure 11. 

The global optimal control solution obtained in the domain B x is shown in Figure 12, 
which indicates the existence of switching curves of the bang-bang optimal control. Numeri- 
cal solutions of the switching curves are obtained with the curve fitting method by using this 
control solution. These curves are superimposed in Figure 12. The switching curves as well 
as the control solutions are mapped back to the physical domain B y as shown in Figure 13. 
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It is interesting to point out the qualitative differences in the controlled response for states 
in the third quadrant of O y . While for some states, marked with crosses, the control speeds 
up the system favoring impact in others avoids it. 

Consider now the same uniformly distributed initial condition in B y . The controlled 
system is found to converge to the steady state PDF shown in Figure 14 in 4 time units 
with 98% of the probability remaining in B y . As discussed in the previous example, the 
switching curves in this example will split a uni-modal PDF of the response into a bi-modal 
one during the transient. In particular, in the third quadrant of D y just before impact, part 
of the probability flow speeds up while the other slows down, thus splitting the PDF. The 
time evolutions of the cost and moments of the response are shown in Figure 15. As before, 
the controlled response (i) converges to the target set with high probability, (ii) minimizes 
the cost and (iii) maximizes the probability of staying in the computational domain. 

5 CONCLUSIONS 

This paper presents a study on the optimal control problem of nonlinear stochastic systems 
using Bellman’s principle of optimality, the short-time Gaussian approximation and the 
Generalized Cell Mapping. Control problems of several challenging nonlinear systems with 
fixed final state terminal conditions subject to state and control constraints are studied 
to demonstrate the effectiveness of the approach. In particular, nonlinear systems with 
a state dependent diffusion part and non-analytically closeable terms are considered. A 
novel strategy based on multiple local Taylor expansions is proposed to handle the the non- 
closeablc terms of the differential equations for the moments of the state variables so that the 
Gaussian closure method can be applied. The optimal control of a nonlinear vibro-impact 
oscillator is studied by using transformation of variables. In the process, the dynamics, the 
cost functional, the admissible state space, and the target set are transformed to a new 
domain. Once the global feedback solution is obtained, we transform it back to the physical 
domain where it is evaluated using GCM. In all numerical examples, the method leads to 
controls with excellent performance. Computational improvements and extensions to higher 
dimensional processes deserve further investigation. A rigorous study of convergence and 
stability of the method is still elusive at this time, and should be pursued in the future. 
Since the solution of optimal control problems of non-linear stochastic systems are difficult 
to obtain and rare in the literature, this investigation offers a valuable contribution to the 
held. 
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Appendix A 


Details on the derivation of the equations for the problem studied in Section 4.1 are 
presented here. From Equation (10), we identify: 


0 0 
1 -/isgn(x 2 ) 


Q =2 


D f D vf 
D vf D v 


(25) 


The Wong-Zakai correction term is given by Am = [0, Am] T , where 

1 d 1 , 

Am = - 02 S Q S 2 q^v ?2 - - (2D,, + 2D,fisgii(x 2 )) fisgn'(x 2 ) 


(26) 


Repeated indices are used to indicate summation. The diffusion term in the FPK equation 
is given by: 


crQa T 


0 0 
0 2Dj + 4//D ;l/ se;ii!x 2 j + 2^ 2 D.,sgn' 2 (x 2 )) 


(27) 


An equivalent single Wiener process leading to the same FPK equation has the diffusion 
term a e = [0,cr e ] T with a e = (2Df + AfiD v fSgn(x 2 ) + 2fi 2 D v sgn 2 (x2))) 1 ^ 2 ■ Hence, the Ito 
SDE is given by dx(f) = (m(x, u) + Am) dt + cr e dB(t), which is Equation (12). 


Appendix B 


Details on the derivation of the equations for the problem studied in Section 4.2 are 
presented here. From Equation (15) we fold 


a = a 


v 0 
0 v 


Q =2 


D] D V2 
D\2 D 2 


(28) 


where v is an arbitrary function of the state variables. The Wong-Zakai correction term is 
given by A rrij = \(?ksQsi-^Vji- After some manipulations we obtain: 


Ami 

A m 2 


a 2 v(Di 

a 2 v(D 2 


dv 

dxi 

dv 

dx 2 


+ D 12 

+ Di 2 


dv , 
dx 2 
dv . 
dx i 


( 29 ) 
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The diffusion term in the FPK equation is given by cQc 1 = 2ct 2 u 2 Q. An equivalent Wiener 
process leading to the same FPK equation has the diffusion terms: 


&ei — 
°e2 — 

Hence, the Ito SDE is given by: 
dx i = ^cos(w) 


dv 2 


2a 2 \^Div 2 + D 12 / ^—dxi 
2a 2 ( D- 2 V 2 + D \2 


dx 2 

dv 2 


dx\ 


dx- 


1/2 

1/2 


(30) 


x 2 v i 2 - — dv „ dv 


_ + a u(Z?i— h D\2 w — ) ) dt + a e idBi 

\ /XI+X2 1 cjx '^ 


(31) 


d;r 2 = sin(rt) + 


aqu , 2 , „ dv „ dv 


+ a u(D 2 — + D 12 — ) 1 dt + a e2 dH 2 


For uncorrelated white noise processes, the moment equations are: 

dv 


m 10 = cos(w) — A 
rhoi = sin(w) + A 


x 2 u 
. r 

XiU' 


r 


+ a DiE 


+ a 2 D 2 E 


dx\ 

dv 


dx 2 


rhu = cos(w)moi — E 


r x 2 v 


+ a 2 DiE 


x 2 v dv 
r dxi 


+ sin(u)mio + E 


r x 2 v 


+ O' D2E 


X\V dv 
r dx 2 
(32)' 


m 20 = 2mi 0 cos(u) — 2E 
rh 0 2 = 2m 0 i sin(w) + 2E 1 


'XIX 2 V 

+ 2a 2 DiE 

x 1 v dv 

. r 

r dx 1 

'X 1 X 2 V 

+ 2 a 2 D 2 E 

x 2 v dv 

. r 

r dx 2 


+ 2 a 2 DiE [u 2 ] 
+ 2 a 2 D 2 E [u 2 ] 


In the example, we used Taylor expansions for the functions v and vj \J x\ + x 2 up to the 
second order. After substituting the polynomial approximations into Equation (32), the 
expected values of the polynomial functions in terms of lower order moments are obtained. 

Appendix C 

The standard joint Gaussian conditional probability density function of x\ and x 2 is given 


by: 


p(xi,x 2 ,T\x m ,X2{ 0 ),0) = 


: exp 


— af — 2 paia 2 + a 2 
2(1 -P 2 ) 


(33) 


27r<7i<7 2 a/(1 - p 2 ) 

where p = m . = E[xi\, a 2 = E[(xi - m* ) 2 ], cq = for i = 1,2. 

In these expressions, m; and a, and p are the means, standard variations and correlation 
coefficient of the density function evaluated at time t — r when the system starts from 
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the deterministic initial condition x = (xi,X 2 ) t ■ Notice that for real variables, p always lies 
between —1 and 1 . Recall that expected values of an arbitrary function of x\ and x 2 using 
the Gaussian closure, can be computed from: 


r»oo /»oo 


E\f{x^x 2 )] = 


r —OO J — OO 


f(x i,x 2 ) 


27ro- 1( T2^(l - p 2 ) 
The following expressions have been found. 


exp 


o\ — ‘2pa,\a,2 + a 2 
2(1 - p 2 ) 


dxidx 2 


^[sgn(xi)] = sgn ( ) erf f -^= ) £ [sgn' (ay)] = — %= exp 


a. 


V 2 


a , 


V 2 


a 


\/ 2 n 


mt 


2 at 


E[xjSgn(xi)} = = exp 


1717 


+ rrijE[sgn.(xi)] £'[sgn(xj)sgn / (xj)] =0 


"V 2 al 

E[(xj - mj)E[sgn.(xi)]\ = 0 , E[(xj - rrij)x?\ = 3 cr 2 (cr 2 + m 2 ) 


pr ,/ \-i —rriipaia 2 

E[Xj sgn (ay)J = — q 7^= exp 


cr; 




mj 


2 erf 


+ mj^fsgn^Xj)] £[sgn(ay) 2 ] = 1 


E[xjSgn(xi)sgn (xi)] = 0 £[aysgn(ay)[ = 


2 o-j 

y/ 2 ^ 


exp 


rri 7 


2 er 2 


+ m.iE[sgn(xi) 


7 ? [ay sgn (ay) sgn' (ay)] = 0, E[(xj - m^xf) = 3 paia 2 (cr 2 + m 2 ) 
£[aysgn'(ay)] = 0 , E[xf] = mj( 3 cr 2 + m 2 ) 


( 34 ) 


( 35 ) 

( 36 ) 

( 37 ) 

( 38 ) 

( 39 ) 

( 40 ) 

( 41 ) 
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Figure 1: Vector field of the expected trajectories of the controlled response. 



Figure 2: Contours of the stationary PDF of the uncontrolled response of the mass block 
system with dry friction damping. 
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Figure 3: Contours of the stationary PDF for the controlled response of the mass block system 
with dry friction damping. 




Figure 4: Time evolutions of the scaled cost Jo and the probability Po of the system with dry 
friction damping. Uncontrolled: ( — ) and controlled: ( ). 
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Figure 5: Expected vector field of the uncontrolled trajectories and a single trajectory starting 
from x = (0,-1). 



r f s f f (■ 

/ / / / f S / T 
.■-■ .-' .■-■ .■ J ,. J .■* .■•' .■■' / 

.-- .s' s' S S f s' ..■■ .■•. 

_■*“ y y y y y r 
-- y 

i.-_- y y y / / y r y .• 



Figure 6: Time evolutions of E[L\ and the first two order moments of the state variables of 
the uncontrolled trajectories of the boat navigating in the vortex field. Dashed lines: moments 
of x i. Continuous lines: moments of x 2 . 
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Figure 7: Expected vector field of the controlled trajectories and single controlled trajectory 
starting from x = (0, —1). The target cells are marked with crosses. 



Time 


Figure 8: Time evolutions of E[L\ and the first two order moments of the state variables of 
the controlled trajectories of the boat navigating in the vortex field. Dashed lines: moments 
of x i. Continuous lines: moments of x 2 . 
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Figure 9: The PDF of the controlled trajectories of the boat after 0.3 time units for the initial 
condition x(0) = (0.8, —0.32). The initial condition is close to the discontinuity in optimal 
controls. 



Figure 10: The PDF of the uncontrolled response of the system with an impact barrier at 
y = — 1 after 3 time units. The system starts from the initial condition y = [—0.7, —0.7]. 
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Figure 11: Time evolutions of Po ( ), Jo (—) and the moments m w (— ■ —), m 0 1 ( — ), 

m 2 o (— ■ —), and m 0 2 ( — ) of the uncontrolled response of the vibro-impact system. 



Figure 12: Global optimal control solution of the vibro-impact system in the transformed 
domain O x . Cells marked with circles denote the regions where the optimal control is u* = u, 
and cells marked with crosses represent the regions where the optimal control is u* = —u. 
The continuous lines represent the switching curves 


22 



Figure 13: Global optimal control solutio7i of the vibro-impact system in the original domain 
D y . The legends are the same as in Figure 12. 



Figure 14: Stationary PDF of the controlled response of the vibro-impact system. 
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